Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove RESTARTING/PURGING states #3905

Merged
merged 4 commits into from
May 27, 2024

Conversation

christoph-zededa
Copy link
Contributor

@christoph-zededa christoph-zededa commented May 6, 2024

if an edge app upgrade is done while an EVE update is done,
the app status state stays in RESTARTING although EVE
will restart and therefore halt the app instance; therefore
the app status should go to HALT in this case.

@OhmSpectator
Copy link
Member

Could you please explain why you decided to split the inactivation function?

Also, please test the snapshot functionality. This code path could be critical for the volume snap creation as we wait for the app to be deactivated during the purge. I can share some details in DM on how to perform it.

@OhmSpectator
Copy link
Member

The places in the code I'm worrying about:

  1. domainStatus := lookupDomainStatus(ctx, uuidStr)
  2. if !uninstall && domainStatus != nil && !domainStatus.Activated {

    And here is why:
  3. 4. **Waiting for App Deactivation**: After the above steps, EVE is ready to
  4. 8. **Waiting for App Deactivation**: After marking the application for purge,
  5. 5. **Wait For Deferred Deactivation**

TL;DR: The code of the deactivation detection during the snap operations is pretty tricky, and I figured out the places where the wait should happen empirically. So it's pretty fragile.

I'll run the tests for snap manually.

@OhmSpectator
Copy link
Member

We ran a simple Snap/Rollback/Delete test on an ext4-based app, and it looks working.

@christoph-zededa christoph-zededa marked this pull request as ready for review May 6, 2024 15:30
@rouming
Copy link
Contributor

rouming commented May 7, 2024

Do you understand what is a new state to which we transit from RESTARTING? I do not see how we can switch to HALTED from the doUpdate(), rather than these lines:

status.State = types.HALTED

	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}

so we should set INSTALLED or START_DELAYED in the doInactivateHalt().

The other branch where we can transit to HALTED is doCleanup() called from the doInactivate(), but this is not our case I assume.

@rouming
Copy link
Contributor

rouming commented May 7, 2024

Also, what is so special about PURGING? You still keep PURGING in your check.
The original commit:

commit 6bf39d2b4453f841cff799bd63292ee191a5c627
Author: eriknordmark <erik@zededa.com>
Date:   Tue Sep 25 00:01:38 2018 -0700

    do not clobber RESTARTING and PURGING states

introduced the following:

+       switch status.State {
+       case types.RESTARTING, types.PURGING:
+               // Leave unchanged
+       default:
+               status.State = minState
+       }

for these callbacks: doInstall(), doPrepare(), doActivate(), but not for the doInactivateHalt().

and then this commit followed:

commit b24e1c2cae3e8b96934b57f1527862ec00ae4d06
Author: eriknordmark <erik@zededa.com>
Date:   Mon Oct 15 16:53:43 2018 -0700

    update from DomainStatus even if Pending

which covers the doInactivateHalt() with the same hunk.

@eriknordmark do we have a special meaning for the RESTARTING,PURGING in the doInactivateHalt()?

@christoph-zededa
Copy link
Contributor Author

Do you understand what is a new state to which we transit from RESTARTING? I do not see how we can switch to HALTED from the doUpdate(), rather than these lines:

status.State = types.HALTED

	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}

so we should set INSTALLED or START_DELAYED in the doInactivateHalt().

The other branch where we can transit to HALTED is doCleanup() called from the doInactivate(), but this is not our case I assume.

doUpdate calls doInactivateHalt, which calls doInactivateHaltFromDomainStatus and this sets the status.(*types.AppInstanceStatus).State to what is in ds.(*types.DomainStatus).State) .

Does this answer your question?

@rouming
Copy link
Contributor

rouming commented May 7, 2024

Does this answer your question?

Not quite. In order to see halted on the cloud you should set it somewhere, right? I see only two places in the code (specified in the comment below). But we should not take any of this paths. So I assume HALTED comes from the domainmgr and we update the applicationstatus with what comes from the domainmgr (now you removed RESTARTING, so can be updated with any state).

But! Take a look here:

status.State = types.START_DELAYED

		// Check that we delay a not yet active VM or a VM in the bring-up state after restarting/purging
		if !status.Activated || status.RestartInprogress == types.BringUp || status.PurgeInprogress == types.BringUp {
			// If we try to activate it for the first time - mark is with the corresponding state
			if status.State != types.START_DELAYED {
				status.State = types.START_DELAYED
				return true
			}
			// if the VM is already in the START_DELAYED state - just return from the doActivate now
			return changed
		}

This is piece of code, which should eventually delay the start if we are in restart or purge.

Then eventually this code should do the final state transit to HALTED:

status.State = types.HALTED

	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}

Either this two chunks are for something different, or they never worked as expected and you simply bypass them for only RESTARTING case (PURGING seems also should be covered). That's why my question: what exact state comes from domainmgr, which you set to the applicationstatus?

@christoph-zededa
Copy link
Contributor Author

Does this answer your question?

Not quite. In order to see halted on the cloud you should set it somewhere, right? I see only two places in the code (specified in the comment below). But we should not take any of this paths. So I assume HALTED comes from the domainmgr and we update the applicationstatus with what comes from the domainmgr (now you removed RESTARTING, so can be updated with any state).

But! Take a look here:

status.State = types.START_DELAYED

		// Check that we delay a not yet active VM or a VM in the bring-up state after restarting/purging
		if !status.Activated || status.RestartInprogress == types.BringUp || status.PurgeInprogress == types.BringUp {
			// If we try to activate it for the first time - mark is with the corresponding state
			if status.State != types.START_DELAYED {
				status.State = types.START_DELAYED
				return true
			}
			// if the VM is already in the START_DELAYED state - just return from the doActivate now
			return changed
		}

This is piece of code, which should eventually delay the start if we are in restart or purge.

Then eventually this code should do the final state transit to HALTED:

status.State = types.HALTED

	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}

Either this two chunks are for something different, or they never worked as expected and you simply bypass them for only RESTARTING case (PURGING seems also should be covered). That's why my question: what exact state comes from domainmgr, which you set to the applicationstatus?

From direct discussion:
It only goes into START_DELAYED state if

if time.Now().Before(status.StartTime) {

Therefore it cannot work like this and it is something different.

@rouming
Copy link
Contributor

rouming commented May 8, 2024

Yes, I thought the START_DELAYED is an intermediate state, thanks to @OhmSpectator for explaining.

Copy link
Contributor

@rouming rouming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically partially reverts the original commit:

commit b24e1c2cae3e8b96934b57f1527862ec00ae4d06
Author: eriknordmark <erik@zededa.com>
Date:   Mon Oct 15 16:53:43 2018 -0700

    update from DomainStatus even if Pending

Unfortunately not too much explanation there.

We spent quite some time understanding the states transitions. I do not see any bad side effects by removing the special check for restarting/purging and letting the transition happen. We have to be sure though, that the final state of the app has to be HALTED and we don't reboot the node somewhere in the middle, otherwise customer data can be corrupted, but @christoph-zededa verified that thoroughly.

@eriknordmark
Copy link
Contributor

@eriknordmark do we have a special meaning for the RESTARTING,PURGING in the doInactivateHalt()?

They are special in that they are pending operations and not really state like the others. But we need to track that those operations are pending somewhere. I think that can be refactored so that we set HALTING, HALTED, BOOTING in State (hence report those to the controller) and we track the pending restart/purge in some separate field in AppInstanceStatus.
[Sorry, I haven't looked at the diffs in this PR yet - just got back from vacation.)

@rouming
Copy link
Contributor

rouming commented May 14, 2024

@eriknordmark do we have a special meaning for the RESTARTING,PURGING in the doInactivateHalt()?

They are special in that they are pending operations and not really state like the others. But we need to track that those operations are pending somewhere. I think that can be refactored so that we set HALTING, HALTED, BOOTING in State (hence report those to the controller) and we track the pending restart/purge in some separate field in AppInstanceStatus. [Sorry, I haven't looked at the diffs in this PR yet - just got back from vacation.)

We already have the following fields:

status.RestartInProgress == (NotInprogress | BringUp | BringDown)
status.PurgeInProgress == (NotInprogress | BringUp | BringDown)

So this information should not be lost.

@eriknordmark
Copy link
Contributor

So this information should not be lost.

Can you all check whether there is any code which looks at the app instance status State field and does something different it is RESTARTING or PURGING (apart from the lines which avoid overwriting those values).
Or does all of the logic inside zedmanager look at RestartInProgress and PurgeInProgress? If so we can just stop using the RESTARTING and PURGING states - the reported state will then become RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING when someone does a restart or a purge.

"Ask not what you can add but what you can remove".

@christoph-zededa
Copy link
Contributor Author

Can you all check whether there is any code which looks at the app instance status State field and does something different it is RESTARTING or PURGING (apart from the lines which avoid overwriting those values). Or does all of the logic inside zedmanager look at RestartInProgress and PurgeInProgress? If so we can just stop using the RESTARTING and PURGING states - the reported state will then become RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING when someone does a restart or a purge.

"Ask not what you can add but what you can remove".

Do you mean:

  1. Only in this special case when app upgrade and eve node update are done at the same time? Then it should not use RESTARTING?
  2. In general RESTARTING should not be used?
    In this case what should we report to the controller when we do a restart?

RESTARTING is used (including reads):

 │  types.go:47 
 │  eve/pkg/pillar/cmd/zedagent/parseconfig.go:222
 │  eve/pkg/pillar/cmd/zedmanager/updatestatus.go:806
 │  eve/pkg/pillar/cmd/zedmanager/updatestatus.go:1178
 │  eve/pkg/pillar/types/types.go:101    
 │  eve/pkg/pillar/types/types.go:173         

The IMO only interesting check is in parseconfig.go:222 in countRunningApps.

same for PURGING.

@eriknordmark
Copy link
Contributor

2. In general RESTARTING should not be used?
In this case what should we report to the controller when we do a restart?

Yes, that is what I mean.

The user asked to see HALTED, which means the underlying behavior from domainmgr will report RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING as the restart proceeds. Same for purging.

@christoph-zededa
Copy link
Contributor Author

Yes, that is what I mean.

The user asked to see HALTED, which means the underlying behavior from domainmgr will report RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING as the restart proceeds. Same for purging.

Hmm, I tried, but I don't think this is the way forward:

diff --git a/pkg/pillar/cmd/zedagent/handlemetrics.go b/pkg/pillar/cmd/zedagent/handlemetrics.go
index 2704d45582..d20c6c9b64 100644
--- a/pkg/pillar/cmd/zedagent/handlemetrics.go
+++ b/pkg/pillar/cmd/zedagent/handlemetrics.go
@@ -1034,10 +1034,50 @@ func encodeProxyStatus(proxyConfig *types.ProxyConfig) *info.ProxyStatus {
        return status
 }
 
+var ignoreSendingOutFollowingRestarts map[string]struct{}
+
+func init() {
+       ignoreSendingOutFollowingRestarts = make(map[string]struct{})
+}
+
+func PublishAppInfoToZedCloud(ctx *zedagentContext, uuid string,
+       aiStatus *types.AppInstanceStatus,
+       aa *types.AssignableAdapters, iteration int, dest destinationBitset) {
+       if aiStatus.State != types.RESTARTING {
+               publishAppInfoToZedCloud(ctx, uuid, aiStatus, aa, iteration, dest)
+               delete(ignoreSendingOutFollowingRestarts, uuid)
+               return
+       }
+
+       _, found := ignoreSendingOutFollowingRestarts[uuid]
+       if found {
+               return
+       }
+
+       for _, state := range []types.SwState{types.HALTING, types.HALTED, types.BOOTING} {
+               aiStatusCopy := *aiStatus
+               aiStatusCopy.State = state
+
+               publishAppInfoToZedCloud(ctx, uuid, &aiStatusCopy, aa, iteration, dest)
+
+               // do not overwrite states, but send them out first!
+               if dest&ControllerDest != 0 {
+                       deferredCtx := zedcloudCtx.DeferredEventCtx
+                       deferredCtx.HandleDeferredNow()
+               }
+               locConfig := ctx.getconfigCtx.sideController.locConfig
+               if dest&LOCDest != 0 && locConfig != nil {
+                       zedcloudCtx.DeferredPeriodicCtx.HandleDeferredNow()
+               }
+
+               ignoreSendingOutFollowingRestarts[uuid] = struct{}{}
+       }
+}
+
 // This function is called per change, hence needs to try over all management ports
 // When aiStatus is nil it means a delete and we send a message
 // containing only the UUID to inform zedcloud about the delete.
-func PublishAppInfoToZedCloud(ctx *zedagentContext, uuid string,
+func publishAppInfoToZedCloud(ctx *zedagentContext, uuid string,
        aiStatus *types.AppInstanceStatus,
        aa *types.AssignableAdapters, iteration int, dest destinationBitset) {
        log.Functionf("PublishAppInfoToZedCloud uuid %s", uuid)
diff --git a/pkg/pillar/zedcloud/deferred.go b/pkg/pillar/zedcloud/deferred.go
index 5e2610c0c0..34039a82d2 100644
--- a/pkg/pillar/zedcloud/deferred.go
+++ b/pkg/pillar/zedcloud/deferred.go
@@ -132,6 +132,10 @@ func (ctx *DeferredContext) processQueueTask(ps *pubsub.PubSub,
        }
 }
 
+func (ctx *DeferredContext) HandleDeferredNow() {
+       ctx.handleDeferred()
+}
+

It basically checks if the state is RESTARTING and then sends out (and immediately flushes) the states HALTING, HALTED, BOOTING.

The only way that RESTARTING can be set in domainStatus.State (afaik) is in zedmanager.go under

  1224      if config.RestartCmd.Counter != oldConfig.RestartCmd.Counter ||                                                                                                   
  1225          config.LocalRestartCmd.Counter != oldConfig.LocalRestartCmd.Counter {               

@christoph-zededa
Copy link
Contributor Author

The only way that RESTARTING can be set in domainStatus.State (afaik) is in zedmanager.go under

  1224      if config.RestartCmd.Counter != oldConfig.RestartCmd.Counter ||                                                                                                   
  1225          config.LocalRestartCmd.Counter != oldConfig.LocalRestartCmd.Counter {               

@rouming pointed out that all this is not necessary and that I can just remove the

 1238              status.State = types.RESTARTING       

@eriknordmark
Copy link
Contributor

Hmm, I tried, but I don't think this is the way forward:

I'm not saying you should artificially generate those states, but instead use the existing logic in zedmanager to send those states. This implies that you'd never set the state to RESTARTING or PURGING. Thus the code can not read those but should rely on the restart/purge inprogress fields to have zedmanager go through exactly the same steps as it does today. Only visible change should be that zedagent never sees, hence doesn't send, RESTARTING or PURGING.

We can chat tomorrow if that is helpful - and note that I haven't looked at the existing code recently so I might be missing things.

@christoph-zededa christoph-zededa changed the title Allow HALT status to be sent to controller when in RESTART state Remove RESTART/PURGING states May 17, 2024
@OhmSpectator
Copy link
Member

I will retest the PR manually... I wanna be sure it does not break my stuff.

@OhmSpectator
Copy link
Member

I've tested what I wanted, and it looks fine...

But we should also announce this change loud, I believe:

  1. The test team may have some tests that expect the states as they are now.
  2. As it's seen in the UI in the events flow, we should inform users about the change. Users may get used to the current events flow and expect this to remain the same. In the worst case, it can break some of their automation.

@eriknordmark
Copy link
Contributor

I've tested what I wanted, and it looks fine...

But we should also announce this change loud, I believe:

  1. The test team may have some tests that expect the states as they are now.
  2. As it's seen in the UI in the events flow, we should inform users about the change. Users may get used to the current events flow and expect this to remain the same. In the worst case, it can break some of their automation.

In addition to testing and annoucing, please check the docs we have in the eve repo and check with the ZEDEDA commercial controller documentation if there are things which need to be updated there. (We don't have documentation at this level of detail for Adam/Eden so no need to check there.)

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

state transition from RESTARTING to HALT(ING)

if an edge app upgrade is done while an EVE update is done,
the app status state stays in RESTARTING although EVE
will restart and therefore halt the app instance; therefore
the app status should go to HALT in this case.

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
'lock' is used for locking deferredItems, so let's
clarify this!

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
include removing dead code

Signed-off-by: Christoph Ostarek <christoph@zededa.com>

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
doing a purge; otherwise HALTED state would be reported
directly

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
@christoph-zededa
Copy link
Contributor Author

In addition to testing and annoucing, please check the docs we have in the eve repo and check with the ZEDEDA commercial controller documentation if there are things which need to be updated there. (We don't have documentation at this level of detail for Adam/Eden so no need to check there.)

I did not find anything regarding this yet (one thing I still have to check for the commercial controller).

I found that some tests/eden rely on RESTARTING or PURGING state, so I changed that to HALTING (see top commit), but I am not sure about the impact; the log files of the failed eden tests don't complain about those two states.

@christoph-zededa christoph-zededa changed the title Remove RESTART/PURGING states Remove RESTARTING/PURGING states May 27, 2024
@eriknordmark
Copy link
Contributor

I found that some tests/eden rely on RESTARTING or PURGING state, so I changed that to HALTING (see top commit), but I am not sure about the impact; the log files of the failed eden tests don't complain about those two states.

The changes to eden tests make sense, but need to go as a PR to the eden repo.

@christoph-zededa
Copy link
Contributor Author

I found that some tests/eden rely on RESTARTING or PURGING state, so I changed that to HALTING (see top commit), but I am not sure about the impact; the log files of the failed eden tests don't complain about those two states.

The changes to eden tests make sense, but need to go as a PR to the eden repo.

PR is here: https://github.com/lf-edge/eden/pull/970/files - I tried it out locally with one of the tests and it fixes it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants